# A pro-government disinformation campaign on Indonesian Papua

## Overview
We provide 4 R scripts and an RMarkdown file, numbered sequentially, for the portion of the project coded in R. These scripts follow on from the Jupyter Notebooks, detailed in a separate README file,  used to scrape, extract, and detect the language of tweets for the article _A pro-government disinformation campaign on Indonesian Papua_ published in _Harvard Kennedy School Misinformation Review_, October 2022, Volume 3#, Issue 5#. R was used for further data wrangling, analysis, and creating tables and figures.

## Software environment
These R scripts and the RMarkdown file were run in RStudio Server Version 1.4.1717 on a Melbourne Research Cloud virtual machine, provided by Research Computing Services at the University of Melbourne. The operating system for the virtual machine was Ubuntu 18.04.6 LTS.  

## Dependencies
Dependencies are managed using [renv](https://rstudio.github.io/renv/index.html) and specified in the file `renv.lock`. To use this file to create the required environment, install package renv (`install.packages("renv")`) and then run `renv::restore()`.

## Data
Due to restrictions from Twitter API v2 for Academic Research (https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases), we are only able to provide a severely limited version of the dataset used in this analysis.

## Order of execution
The scripts and RMarkdown file were run in the order of their numbering. No other scripts have to be run.

| Script | Description |
|----------------------|---------------------------------------------------------------------------------------------------------------|
| 1_wrangling.R | Creates columns for the type of tweet, the dates tweets and accounts were created, a tweet URL, and columns to indicate which query/s returned each tweet in the dataset. A set of 1,500 irrelevant tweets that mentioned an account named similarly to prominent independence activist Benny Wenda are also removed. | 
| 2_creating_indo_dt.R | Creates columns for the dates tweets and accounts were created using the Jakarta timezone, a column for the time of day of tweets, and a column that calculates the amount of engagement each tweet in the dataset received. The dataset is then filtered to remove any tweets not marked as Indonesian language, subsequent to which a column that calculates the number of search terms that returned each tweet is created. | 
| 3_summary_statistics.Rmd | Creates Table 1, Figure 7, and calculates how many tweets were returned by more than one search term | 
| 4_vertical_bands.R | Demonstrates the concentration of tweets in the minutes starting 6:55 am an 8:00am Jakarta time, and using Quanteda and its sub-packages to run a Jaccard similarity test in combination with Igraph to generate components, shows that most of the tweets posted in these two minutes comprise sets of duplicate or near duplicate tweets. | 
| 5_vblr_topics_and_author_stats.R | Calculates the proportion of tweets posted in the minutes starting 6:55 am and 8:00 am that discuss various topics, and generates samples of the authors of these tweets to enable observations about their account characteristics. The proportion of accounts that Twitter has suspended since we scraped our dataset are also calculated. | 
